The chloroplasts genomic analyses of four specific Caragana species

Background Many species of the genus Caragana have been used as wind prevention and sand fixation plants. They are also important traditional Chinese medicine, and ethnic medicine resource plant. Thus, chloroplast genomes (cp-genome) of some of these important species must be studied. Methods In this study, we analyzed the chloroplast genomes of C. jubata, C. erinacea, C. opulens, and C. bicolor, including their structure, repeat sequences, mutation sites, and phylogeny. Results The size of the chloroplast genomes was between 127,862 and 132,780 bp, and such genomes contained 112 genes (30 tRNA, 4 rRNA, and 78 protein-coding genes), 43 of which were photosynthesis-related genes. The total guanine + cytosine (G+C) content of four Caragana species was between 34.49% and 35.15%. The four Caragana species all lacked inverted repeats and can be classified as inverted repeat-lacking clade (IRLC). Of the anticipated genes of the four chloroplast genomes, introns were discovered in 17 genes, most of which were inserted by one intron. A total of 50 interspersed repeated sequences (IRSs) were found among them, 58, 29, 61, and 74 simple sequences repeats were found in C. jubata, C. bicolor, C. opulens, and C. erinacea, respectively. Analyses of sequence divergence showed that some intergenic regions (between trnK-UUU and rbcl; trnF-GAA and ndhJ; trnL-CAA and trnT-UGU; rpoB and trnC-GCA; petA and psbL; psbE and pebL; and sequences of rpoC, ycf1, and ycf2) exhibited a high degree of variations. A phylogenetic tree of eight Caragana species and another 10 legume species was reconstructed using full sequences of the chloroplast genome. Conclusions (1) Chloroplast genomes can be used for the identification and classification of Caragana species. (2) The four Caragana species have highly similar cpDNA G+C content. (3) IRS analysis of the chloroplast genomes showed that these four species, similar to the chloroplast genome of most legumes, lost IRLC regions. (4) Comparative cp-genomic analysis suggested that the cp genome structure of the Caragana genus was well conserved in highly variable regions, which can be used to exploit markers for the identification of Caragana species and further phylogenetic study. (5) Results of phylogenetic analyses were in accordance with the current taxonomic status of Caragana. The phylogenetic relationship of Caragana species was partially consistent with elevation and geographical distribution.


Introduction
About 100 Caragana species exist worldwide. Among the species in the arid and semi-arid regions of Asia and Europe, 66 species (32 endemic) can be found in China [1]. The genus Caragana is a deciduous shrub with a wide range of adaptability and strong stress tolerance. Most of the Caragana species are distributed in higher elevations and relatively harsh environments (barren, drought, heat, and cold); they are known to prevent wind and fixate sand [2][3][4][5]. Moreover, previous studies have shown that many plants have pharmacological, antibacterial, and antioxidant activities and anti-tumor, anti-HIV, and other effects [2,5,6]. All four species in this study have been documented in traditional Chinese ethnic medicine. Among them, C. jubata is an important Tibetan drug that can be used to treat alpine erythrocytosis and hypertension; it possesses hepatoprotective and antiviral activities [7][8][9][10][11].
Phylogenetic relationships in the genus Caragana remain obscure, and also have some problems in the identification of medicinal species. Only 10 cp-genome of the Caragana genus have been reported and low amount of data are available for analysis [12][13][14].
Chloroplasts are the posterity of ancient microbacillary endosymbionts. They are the usual organelles of green plants, which play an indispensable role in photosynthesis [15]. Ordinarily, the descendibility of the chloroplast genome is maternal in angiosperms [16]. The chloroplast genome is relatively stable in structure, and it contains a large single-copy region, small singlecopy region, and two inverse repeat (IR) regions. Inverted repeat-lacking clade (IRLC) has been reported in legumes [17-20]: four Caragana species have been reported with IRLC [11][12][13]. Therefore, the Caragana genus can represent a lineage with extensive IRLC. However, knowledge of the pattern, origin, and evolution of plastomic IRLC within Caragana is presently limited by the scarcity of plastomic sequences. In addition, the chloroplast genomic model can be used to study molecular identification, phylogeny, species conservation, and evolution [21,22].
In the present study, the four species of the genus Caragana from Ganzi Tibetan Autonomous Prefecture of Sichuan Province and Qinghai Province, China, were identified on the basis of the chloroplast genome. The structural characteristics, population genetics, phylogenetic relationships, and phylogenetic trees were documented.
The cp reads were used to assemble sequences by spades, abyss, and soapdenovo. All of the contigs were aligned to the reference cp genome of C. korshinskii with MUMmer. Finally, the assembly results were inhole repaired with GapCloser-1.12 (OMEGA) [24].

Gene annotation and sequence analyses
Sequences were annotated by Plann [25] using the chloroplast genome of C. korshinskii from NCBI and some manual corrections. BLAST and Apollo [26] were used to check the start and stop codons and intron/exon boundaries with the cp genome of C. korshinskii as the reference sequence. The complete chloroplast genome sequence data reported in this paper have been deposited in the Genome Warehouse in National Genomics Data Center (NGDC https://ngdc. cncb.ac.cn/, accession numbers: GWHBJYO00000000, GWHBJYN00000000, GWHBJYM 00000000, and GWHBJYL00000000). The structural features of the chloroplast genome were drawn by Organellar Genome DRAW [27] (http://ogdraw.mpimp-golm.mpg.de/). Proteincoding gene sequences were extracted by Geneious.

Comparison of chloroplast genomes
The chloroplast genomes of Caragana species were completed by mVISTA [28] (Shuffle-LAGAN mode) using the genome of C. korshinskii as the reference. The detecting and testing of forward, palindromic, and tandem repeats were performed using Tandem Repeats Finder [26] and REPuter [27]. In addition, the detection of simple sequence repeats (SSRs) was executed using Misa.pl [29]. The search parameters of mononucleotides, dinucleotides, trinucleotides, tetranucleotides, pentanucleotides, and hexanucleotide were set to �10, �8, �4, and �3 repeat units.

Phylogenetic analyses
Phylogenetic trees were constructed using plastid genomes of 18 species, in which Ranitomeya imitator was an outgroup. The sequences were aligned by Mafft. An unrooted phylogenetic tree with 1000 bootstrap replicates was inferred using the neighbor-joining (NJ) approach with MEGA X [30].

DNA features of the chloroplast genome of four Caragana species
The size of chloroplast genome was between 127,862 and 132,780 bp, which is small because of the loss of the IR region. C. opulens (132780 bp) had the largest chloroplast genome, whereas C. jubata had the smallest (127862 bp). The mean value of the total guanine + cytosine (G+C) content of four Caragana species was 34.72%. Four Caragana species had a chloroplast genome with a similar structure, and all of them loss the IR region. After annotation, the whole chloroplast genome sequence of the four Caragana species was submitted to NGDC: the accession numbers are listed in Table 1.

Analyses of long repetitive sequences and SSRs
For C. jubata, C. bicolor, C. erinacea, and C. opulens, interspersed repeated sequences (IRSs) were evaluated in the chloroplast genome with a repeat-unit length of � 20 bp. These sequences comprised only forward reverse and palindromic repeats, yet they lacked complementary repeats that are common in other species. Among them, a total of 50 IRSs were found. Among all types of IRS, the sequence lengths in the range of 20-39 bp occurred most frequently in C. jubata and 40-59 bp occurred most frequently in C. erinacea. Those in the range of 60-79 bp and � 100 bp occurred most frequently in C. opulens. IRS analyses of chloroplast genomes are shown in Fig 2. The key mutational mechanism generating SSR polymorphism is as follows: SSRS tended to undergo slipped-strand mispairing [32]. However, SSRs in chloroplast genomes are often used as genetic markers in evolutionary and population genetic studies because of their variability at the intra-specific level [33, 34]. We found 58 SSRs in C. jubata, 29 SSRs in C. bicolor, 61 SSRs in C. opulens, and 74 SSRs in C. erinacea (Fig 3).

Comparative genomic analysis
In elucidating the differences in genomic sequences of C. jubata, C. erinacea, C. opulens, and C. bicolor, we used mVISTA to detect sequence variations using the sequence in C. bicolor as a reference (Fig 4). The four genomic sequences are highly similar. However, in some intergenic spacer (IGS) regions and partial sequences, significant differences are found, such as the IGS between trnK-UUU and rbcl; trnF-GAA and ndhJ; trnL-CAA and trnT-UGU; rpoB and trnC-GCA; petA and psbL; psbE and pebL; and sequences of the rpoC, ycf1, and ycf2. The noncoding regions have different degrees of divergence, whereas the protein coding regions are highly conserved. This finding indicated that the IGS of the Caragana genus evolved rapidly. Highly variable regions can be used to exploit markers for identification and further phylogenetic study.

Phylogenetic analyses
In determining the phylogenetic position of Caragana species, 18 complete chloroplast genome sequences of the Fabaceae family were constructed using the NJ tree (Fig 5). The results showed that eight species from Caragana were relatives and categorized together. The following pairs showed a closer relationship: C. kozlowii and C. erinacea, C. microphylla and C. korshinskii, and C. opulens and C. bicolor. The genera Astragalus and Caragana were classified into the Subtrib. Astragalinae: C. jubata belongs to Ser. Jubatae; C. bicolor belongs to Ser. Occidentales; C. erinacea belongs to Ser. Spinosae, and C. opulens belongs to Ser.

PLOS ONE
The chloroplasts genomic analyses of four specific caragana species

PLOS ONE
The chloroplasts genomic analyses of four specific caragana species

Conclusions and discussion
The chloroplast genomes of 10 species of the genus Caragana have been published in the National Center for Biotechnology Information; chloroplast genomes were between 127,103 and 133,122 bp in size, and they contained 110-111 genes (30-31 tRNA, 4 rRNA, and 76 protein-coding genes). Multiple species of Caragana have been reported to loss the IR region, such as C. rosea, C. microphylla, and C. intermedia [12][13][14]. The chloroplast genomes of C. jubata, C. erinacea, C. opulens, and C. bicolor showed high similarity with regard to gene deletion, genome size, gene sequences, gene classes, and distribution of repeat sequences, and the lacked IRLC. An important indicator of species affinity is the content of DNA G + C [36], and the four Caragana species in this study have highly similar cpDNA G+C content. IRS analyses of chloroplast genomes show that the four species lacked complementary repeats. Comparative cp-genomic analysis suggested that the cp genome structure of Caragana was well conserved. Highly variable regions are primarily distributed in non-coding and partial coding regions, which can be used to exploit markers for identification and further phylogenetic study.
Intron and/or gene losses in chloroplast genomes have been reported in considerable literature [37][38][39]. Introns can play an important role in the regulation of gene expression in a temporal and tissue-specific manner [39][40][41]. Regulatory mechanisms of introns in some plants and animals have been reported [42][43][44]. However, the relationship between intron deletion Advances in phylogenetic analysis can reveal the evolution of chloroplast genomes, including nucleotide substitutions and structural changes [45,46]. Our results of phylogenetic analysis were consistent with the status of the major taxa within the genus Caragana [1]. Species from the genus Caragana were monophyletic, and C. jubata, C. erinacea, C. opulens, and C. bicolor could be differentiated from other Caragana species. The current study demonstrated that chloroplast genomes can be used for the identification and classification of Caragana species. In addition, the phylogenetic relationship of Caragana species is related to elevation and geographical distribution (GD). Caragana species have a large altitude span and wide GD, showing strong environmental adaptability [1][2][3][4][5]. Our results can provide valuable information for genetic transformation, the development of population genetic surveys, and evolutionary studies. Plastids contain a range of genes associated with photosynthesis, and photosystem II is a key component of high temperature, drought stress, and many other stresses [47,48]. However, the strong environmental adaptability mechanism of the genus Caragana remains unclear because of the lack of research and data [3,12]. Our results can provide data for further investigation of the discovery of adaptability and strong adversity resistance genes of Caragana species. Our research data complement the database of herbgenomics [49,50].